### **How to Write Fast Numerical Code**

Spring 2012 Lecture 15

Instructor: Markus Püschel

**TAs:** Georg Ofenbeck & Daniele Spampinato



Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich

# Flynn's Taxonomy

|               | Single instruction                           | Multiple instruction            |
|---------------|----------------------------------------------|---------------------------------|
| Single data   | <i>SISD</i> Uniprocessor                     | MISD                            |
| Multiple data | SIMD Vector computer Short vector extensions | MIMD<br>Multiprocessors<br>VLIW |

### SIMD Extensions and SSE

- Overview: SSE family, floating point, and x87
- SSE intrinsics
- Compiler vectorization
- This lecture and material was created together with Franz Franchetti (ECE, CMU)

### **SIMD Vector Extensions**



#### What is it?

- Extension of the ISA
- Data types and instructions for the parallel computation on short (length 2, 4, 8, ...) vectors of integers or floats
- Names: MMX, SSE, SSE2, ...

### Why do they exist?

- Useful: Many applications have the necessary fine-grain parallelism
   Then: speedup by a factor close to vector length
- Doable: Relative easy to design; chip designers have enough transistors to play with



Multimedia extension



Streaming SIMD extension

#### AVX:

Advanced vector extensions



time

# **SSE Family: Floating Point**



- Not drawn to scale
- From SSE3: Only additional instructions
- Every Core 2 has SSE3

## **Overview Floating-Point Vector ISAs**

| Vendor   | Name                                                      |     | u-way                            | Precision                            | Introduced with                                                                                  |
|----------|-----------------------------------------------------------|-----|----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------|
| Intel    | SSE<br>SSE2<br>SSE3<br>SSSE3<br>SSE4<br>AVX               | +   | 4-way<br>2-way<br>8-way<br>4-way | single<br>double<br>single<br>double | Pentium III Pentium 4 Pentium 4 (Prescott) Core Duo Core2 Extreme (Penryn) Core i7 (Sandybridge) |
| Intel    | IPF                                                       |     | 2-way                            | single                               | Itanium                                                                                          |
| Intel    | LRB                                                       |     | 16-way<br>8-way                  | single<br>double                     | Larrabee                                                                                         |
| AMD      | 3DNow!<br>Enhanced 3DNow!<br>3DNow! Professional<br>AMD64 | +++ | 2-way<br>4-way<br>2-way          | single<br>single<br>double           | K6<br>K7<br>Athlon XP<br>Opteron                                                                 |
| Motorola | AltiVec                                                   |     | 4-way                            | single                               | MPC 7400 G4                                                                                      |
| IBM      | VMX<br>SPU                                                | +   | 4-way<br>2-way                   | single<br>double                     | PowerPC 970 G5<br>Cell BE                                                                        |
| IBM      | Double FPU                                                |     | 2-way                            | double                               | PowerPC 440 FP2                                                                                  |

Within an extension family, newer generations add features to older ones Convergence: 3DNow! Professional = 3DNow! + SSE; VMX = AltiVec;

## Core 2

- Has SSE3
- 16 SSE registers

| %xmm0 | %xmm8  |
|-------|--------|
| %xmm1 | %xmm9  |
| %xmm2 | %xmm10 |
| %xmm3 | %xmm11 |
| %xmm4 | %xmm12 |
| %xmm5 | %xmm13 |
| %xmm6 | %xmm14 |
| %xmm7 | %xmm15 |

# **SSE3** Registers



## **SSE3** Instructions: Examples

Single precision 4-way vector add: addps %xmm0 %xmm1



■ Single precision *scalar add:* addss %xmm0 %xmm1



### **SSE3 Instruction Names**







Compiler will use this for floating point

- on x86-64
- with proper flags if SSE/SSE2 is available

## x86-64 FP Code Example

#### Inner product of two vectors

- Single precision arithmetic
- Compiled: uses SSE instructions

```
ipf:
          %xmm1, %xmm1
                                # result = 0.0
  xorps
          %ecx, %ecx
  xorl
                                # i = 0
  jmp
          .L8
                                # goto middle
.L10:
                                # loop:
  movslq %ecx,%rax
                                \# icpy = i
  incl
          %ecx
                                # i++
  movss (%rsi,%rax,4), %xmm0
                                # t = y[icpy]
                                # t *= x[icpy]
  mulss (%rdi,%rax,4), %xmm0
                                # result += t
  addss %xmm0, %xmm1
.L8:
                                # middle:
  cmpl %edx, %ecx
                                # i:n
  jl
     .L10
                                # if < goto loop
  movaps %xmm1, %xmm0
                                # return result
  ret
```

## From Core 2 Manual

### Throughput, latency

|                              |      | ., . |                                         |
|------------------------------|------|------|-----------------------------------------|
| Single-precision (SP) FP MUL | 4, 1 | 4, 1 | Issue port 0; Writeback port 0          |
| Double-precision FP MUL      | 5, 1 | 5, 1 |                                         |
| FP MUL (X87)                 | 5, 2 | 5, 2 | Issue port 0; Writeback port 0          |
| FP Shuffle                   | 1, 1 | 1, 1 | FP shuffle does not handle QW           |
| DIV/SQRT                     |      |      | shuffle.                                |
|                              | 4 4  | 4.4  | C 1 1 C4111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 |

SSE based FP x87 FP

## **Summary**

- On Core 2 there are two different (unvectorized) floating points
  - x87: obsolete, is default on x86-32
  - SSE based: uses only one slot, is default on x86-64
- SIMD vector floating point instructions
  - 4-way single precision: since SSE
  - 2-way double precision: since SSE2
  - Since on Core 2 add and mult are fully pipelined (1 per cycle): possible gain 4x and 2x, respectively

# **SSE:** How to Take Advantage?



- Necessary: fine grain parallelism
- Options:
  - Use vectorized libraries (easy, not always available)
  - Compiler vectorization (today)
  - Use intrinsics (today)
  - Write assembly
- We will focus on floating point and single precision (4-way)

### **SIMD Extensions and SSE**

- Overview: SSE family, floating point, and x87
- SSE intrinsics
- Compiler vectorization

#### **References:**

Intel icc manual (currently 12.0) → Intrinsics reference
<a href="http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/lin/index.htm">http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/lin/index.htm</a>

Visual Studio Manual (also: paste the intrinsic into Google)

http://msdn.microsoft.com/de-de/library/26td21ds.aspx

# **SSE Family: Floating Point**



- Not drawn to scale
- From SSE3: Only additional instructions
- Every Core 2 has SSE3

# **SSE Family Intrinsics**

- Assembly coded C functions
- Expanded inline upon compilation: no overhead
- Like writing assembly inside C
- Floating point:
  - Intrinsics for math functions: log, sin, ...
  - Intrinsics for SSE
- Our introduction is based on icc
  - Most intrinsics work with gcc and Visual Studio (VS)
  - Some language extensions are icc (or even VS) specific

## **Header files**

SSE: xmmintrin.h

SSE2: emmintrin.h

SSE3: pmmintrin.h

SSSE3: tmmintrin.h

SSE4: smmintrin.h and nmmintrin.h

or ia32intrin.h

## **Visual Conventions We Will Use**

Memory
increasing address
memory

### Registers

Before (and common)



Now we will use



# **SSE Intrinsics (Focus Floating Point)**

#### Data types

```
__m128 f; // = {float f0, f1, f2, f3}
__m128d d; // = {double d0, d1}
__m128i i; // 16 8-bit, 8 16-bit, 4 32-bit, or 2 64-bit ints
```



# **SSE Intrinsics (Focus Floating Point)**

#### Instructions

- Naming convention: \_mm\_<intrin\_op>\_<suffix>
- Example:

```
// a is 16-byte aligned
float a[4] = {1.0, 2.0, 3.0, 4.0};
__m128 t = _mm_load_ps(a);
```

p: packed

s: single precision

Same result as

```
_{m128} t = _{mm_set_ps}(4.0, 3.0, 2.0, 1.0)
```

### **SSE Intrinsics**

Native instructions (one-to-one with assembly)

```
_mm_load_ps()
_mm_add_ps()
_mm_mul_ps()
```

Multi instructions (map to several assembly instructions)

```
_mm_set_ps()
_mm_set1_ps()
```

•••

Macros and helpers

```
_MM_TRANSPOSE4_PS()
_MM_SHUFFLE()
```

•••

## What Are the Main Issues?

- Alignment is important (128 bit = 16 byte)
- You need to code explicit loads and stores
- Overhead through shuffles

## **SSE Intrinsics**

- Load and store
- Constants
- Arithmetic
- Comparison
- Conversion
- Shuffles

## **Loads and Stores**

| Intrinsic Name | Operation                                          | Corresponding SSE Instructions |
|----------------|----------------------------------------------------|--------------------------------|
| _mm_loadh_pi   | Load high                                          | MOVHPS reg, mem                |
| _mm_loadl_pi   | Load low                                           | MOVLPS reg, mem                |
| _mm_load_ss    | Load the low value and clear the three high values | MOVSS                          |
| _mm_load1_ps   | Load one value into all four words                 | MOVSS + Shuffling              |
| _mm_load_ps    | Load four values, address aligned                  | MOVAPS                         |
| _mm_loadu_ps   | Load four values, address unaligned                | MOVUPS                         |
| _mm_loadr_ps   | Load four values in reverse                        | MOVAPS + Shuffling             |

| Intrinsic Name | Operation                                         | Corresponding SSE Instruction |
|----------------|---------------------------------------------------|-------------------------------|
| _mm_set_ss     | Set the low value and clear the three high values | Composite                     |
| _mm_set1_ps    | Set all four words with the same value            | Composite                     |
| _mm_set_ps     | Set four values, address aligned                  | Composite                     |
| _mm_setr_ps    | Set four values, in reverse order                 | Composite                     |
| _mm_setzero_ps | Clear all four values                             | Composite                     |

## **Loads and Stores**



```
a = _mm_load_ps(p); // p 16-byte aligned
```

## **How to Align**

\_\_m128, \_\_m128d, \_\_m128i are 16-byte aligned

Arrays:

```
__declspec(align(16)) float g[4];
```

### Dynamic allocation

- mm malloc() and mm free()
- Write your own malloc that returns 16-byte aligned addresses
- Some malloc's already guarantee 16-byte alignment

### **Loads and Stores**



## **Loads and Stores**



a = \_mm\_load\_ss(p); // p any alignment

# **Stores Analogous to Loads**

| Intrinsic Name | Operation                                                  | Corresponding SSE Instruction |
|----------------|------------------------------------------------------------|-------------------------------|
| _mm_storeh_pi  | Store high                                                 | MOVHPS mem, reg               |
| _mm_storel_pi  | Store low                                                  | MOVLPS mem, reg               |
| _mm_store_ss   | Store the low value                                        | MOVSS                         |
| _mm_store1_ps  | Store the low value across all four words, address aligned | Shuffling + MOVSS             |
| _mm_store_ps   | Store four values, address aligned                         | MOVAPS                        |
| _mm_storeu_ps  | Store four values, address unaligned                       | MOVUPS                        |
| _mm_storer_ps  | Store four values, in reverse order                        | MOVAPS + Shuffling            |

### **Constants**

## **Arithmetic**

#### SSE

| Intrinsic Name | Operation               | Corresponding SSE Instruction |
|----------------|-------------------------|-------------------------------|
| _mm_add_ss     | Addition                | ADDSS                         |
| _mm_add_ps     | Addition                | ADDPS                         |
| _mm_sub_ss     | Subtraction             | SUBSS                         |
| _mm_sub_ps     | Subtraction             | SUBPS                         |
| _mm_mul_ss     | Multiplication          | MULSS                         |
| _mm_mul_ps     | Multiplication          | MULPS                         |
| _mm_div_ss     | Division                | DIVSS                         |
| _mm_div_ps     | Division                | DIVPS                         |
| _mm_sqrt_ss    | Squared Root            | SQRTSS                        |
| _mm_sqrt_ps    | Squared Root            | SQRTPS                        |
| _mm_rcp_ss     | Reciprocal              | RCPSS                         |
| _mm_rcp_ps     | Reciprocal              | RCPPS                         |
| _mm_rsqrt_ss   | Reciprocal Squared Root | RSQRTSS                       |
| _mm_rsqrt_ps   | Reciprocal Squared Root | RSQRTPS                       |
| _mm_min_ss     | Computes Minimum        | MINSS                         |
| _mm_min_ps     | Computes Minimum        | MINPS                         |
| _mm_max_ss     | Computes Maximum        | MAXSS                         |
| _mm_max_ps     | Computes Maximum        | MAXPS                         |

#### SSE3

| Intrinsic Name | Operation        | Corresponding SSE3 Instruction |
|----------------|------------------|--------------------------------|
| _mm_addsub_ps  | Subtract and add | ADDSUBPS                       |
| _mm_hadd_ps    | Add              | HADDPS                         |
| _mm_hsub_ps    | Subtracts        | HSUBPS                         |

#### SSE4

| Intrinsic | Operation                    | Corresponding SSE4 Instruction |
|-----------|------------------------------|--------------------------------|
| _mm_dp_ps | Single precision dot product | DPPS                           |

## **Arithmetic**



### analogous:

$$c = _{mm\_sub\_ps}(a, b);$$

$$c = _{mm_mul_ps(a, b)};$$

## **Example**

```
void addindex(float *x, int n) {
  for (int i = 0; i < n; i++)
    x[i] = x[i] + i;
}</pre>
```

How does the code style differ from scalar code? *Intrinsics force scalar replacement!* 

## **Arithmetic**





$$c = _{mm_max_ps(a, b)};$$





#### analogous:

# **Example**

```
// n is even
void lp(float *x, float *y, int n) {
  for (int i = 0; i < n/2; i++)
    y[i] = (x[2*i] + x[2*i+1])/2;
}</pre>
```

**(SSE4)** Computes the pointwise product of a and b and writes a selected sum of the resulting numbers into selected elements of c; the others are set to zero. The selections are encoded in the mask.

**Example:** mask = 117 = 01110101



# **Comparisons**

| Late Control Name | 0                            | C               |
|-------------------|------------------------------|-----------------|
| Intrinsic Name    | Operation                    | Corresponding   |
|                   |                              | SSE Instruction |
| _mm_cmpeq_ss      | Equal                        | CMPEQSS         |
| _mm_cmpeq_ps      | Equal                        | CMPEQPS         |
| _mm_cmplt_ss      | Less Than                    | CMPLTSS         |
| _mm_cmplt_ps      | Less Than                    | CMPLTPS         |
| _mm_cmple_ss      | Less Than or Equal           | CMPLESS         |
| _mm_cmple_ps      | Less Than or Equal           | CMPLEPS         |
| _mm_cmpgt_ss      | Greater Than                 | CMPLTSS         |
| _mm_cmpgt_ps      | Greater Than                 | CMPLTPS         |
| _mm_cmpge_ss      | Greater Than or Equal        | CMPLESS         |
| _mm_cmpge_ps      | Greater Than or Equal        | CMPLEPS         |
| _mm_cmpneq_ss     | Not Equal                    | CMPNEQSS        |
| _mm_cmpneq_ps     | Not Equal                    | CMPNEQPS        |
| _mm_cmpnlt_ss     | Not Less Than                | CMPNLTSS        |
| _mm_cmpnlt_ps     | Not Less Than                | CMPNLTPS        |
| _mm_cmpnle_ss     | Not Less Than or Equal       | CMPNLESS        |
| _mm_cmpnle_ps     | Not Less Than or Equal       | CMPNLEPS        |
| _mm_cmpngt_ss     | Not Greater Than             | CMPNLTSS        |
| _mm_cmpngt_ps     | Not Greater Than             | CMPNLTPS        |
| _mm_cmpnge_ss     | Not Greater Than or<br>Equal | CMPNLESS        |
| _mm_cmpnge_ps     | Not Greater Than or<br>Equal | CMPNLEPS        |

| Intrinsic Name  | Operation             | Corresponding   |
|-----------------|-----------------------|-----------------|
|                 |                       | SSE Instruction |
| _mm_cmpord_ss   | Ordered               | CMPORDSS        |
| _mm_cmpord_ps   | Ordered               | CMPORDPS        |
| _mm_cmpunord_ss | Unordered             | CMPUNORDSS      |
| _mm_cmpunord_ps | Unordered             | CMPUNORDPS      |
| _mm_comieq_ss   | Equal                 | COMISS          |
| _mm_comilt_ss   | Less Than             | COMISS          |
| _mm_comile_ss   | Less Than or Equal    | COMISS          |
| _mm_comigt_ss   | Greater Than          | COMISS          |
| _mm_comige_ss   | Greater Than or Equal | COMISS          |
| _mm_comineq_ss  | Not Equal             | COMISS          |
| _mm_ucomieq_ss  | Equal                 | UCOMISS         |
| _mm_ucomilt_ss  | Less Than             | UCOMISS         |
| _mm_ucomile_ss  | Less Than or Equal    | UCOMISS         |
| _mm_ucomigt_ss  | Greater Than          | UCOMISS         |
| _mm_ucomige_ss  | Greater Than or Equal | UCOMISS         |
| _mm_ucomineq_ss | Not Equal             | UCOMISS         |

# **Comparisons**



#### analogous:

$$c = _{mm\_cmplt\_ps(a, b)};$$

etc.

#### Each field:

0xffffffff if true 0x0 if false

Return type: \_\_m128

# **Conversion**

| Intrinsic Name   | Operation                         | Corresponding SSE Instruction |
|------------------|-----------------------------------|-------------------------------|
| _mm_cvtss_si32   | Convert to 32-bit integer         | CVTSS2SI                      |
| _mm_cvtss_si64*  | Convert to 64-bit integer         | CVTSS2SI                      |
| _mm_cvtps_pi32   | Convert to two 32-bit integers    | CVTPS2PI                      |
| _mm_cvttss_si32  | Convert to 32-bit integer         | CVTTSS2SI                     |
| _mm_cvttss_si64* | Convert to 64-bit integer         | CVTTSS2SI                     |
| _mm_cvttps_pi32  | Convert to two 32-bit integers    | CVTTPS2PI                     |
| _mm_cvtsi32_ss   | Convert from 32-bit integer       | CVTSI2SS                      |
| _mm_cvtsi64_ss*  | Convert from 64-bit integer       | CVTSI2SS                      |
| _mm_cvtpi32_ps   | Convert from two 32-bit integers  | CVTTPI2PS                     |
| _mm_cvtpi16_ps   | Convert from four 16-bit integers | composite                     |
| _mm_cvtpu16_ps   | Convert from four 16-bit integers | composite                     |
| _mm_cvtpi8_ps    | Convert from four 8-bit integers  | composite                     |
| _mm_cvtpu8_ps    | Convert from four 8-bit integers  | composite                     |
| _mm_cvtpi32x2_ps | Convert from four 32-bit integers | composite                     |
| _mm_cvtps_pi16   | Convert to four 16-bit integers   | composite                     |
| _mm_cvtps_pi8    | Convert to four 8-bit integers    | composite                     |
| _mm_cvtss_f32    | Extract                           | composite                     |

## **Conversion**

```
float _mm_cvtss_f32(__m128 a)
```



```
float f;
f = _mm_cvtss_f32(a);
```

#### Cast



Reinterprets the four single precision floating point values in a as four 32-bit integers, and vice versa.

No conversion is performed. Does not map to any assembly instructions.

Makes integer shuffle instructions usable for floating point.

#### SSE

| JUL             |                                         |                               |
|-----------------|-----------------------------------------|-------------------------------|
| Intrinsic Name  | Operation                               | Corresponding SSE Instruction |
| _mm_shuffle_ps  | Shuffle                                 | SHUFPS                        |
| _mm_unpackhi_ps | Unpack High                             | UNPCKHPS                      |
| _mm_unpacklo_ps | Unpack Low                              | UNPCKLPS                      |
| _mm_move_ss     | Set low word, pass in three high values | MOVSS                         |
| _mm_movehl_ps   | Move High to Low                        | MOVHLPS                       |
| _mm_movelh_ps   | Move Low to High                        | MOVLHPS                       |
| _mm_movemask_ps | Create four-bit mask                    | MOVMSKPS                      |

#### SSE3

| Intrinsic Name  | Operation  | Corresponding SSE3 Instruction |
|-----------------|------------|--------------------------------|
| _mm_movehdup_ps | Duplicates | MOVSHDUP                       |
| _mm_moveldup_ps | Duplicates | MOVSLDUP                       |

#### SSSE3

| Intrinsic Name   | Operation | Corresponding SSSE3 Instruction |
|------------------|-----------|---------------------------------|
| _mm_shuffle_epi8 | Shuffle   | PSHUFB                          |
| _mm_alignr_epi8  | Shift     | PALIGNR                         |

#### SSE4

| Intrinsic Syntax                                     | Operation                                                                                   | Corresponding SSE4 Instruction |
|------------------------------------------------------|---------------------------------------------------------------------------------------------|--------------------------------|
| m128 _mm_blend_ps(m128 v1,m128 v2, const int mask)   | Selects float single precision data from 2 sources using constant mask                      | BLENDPS                        |
| m128 _mm_blendv_ps(m128 v1,m128 v2,m128 v3)          | Selects float single precision data from 2 sources using variable mask                      | BLENDVPS                       |
| m128 _mm_insert_ps(m128 dst,m128 src, const int ndx) | Insert single precision float into packed single precision array element selected by index. | INSERTPS                       |
| int _mm_extract_ps(m128 src, const int ndx)          | Extract single precision float from packed single precision array selected by index.        | EXTRACTPS                      |





helper macro to create mask

```
any element of a of b
```

# **Example: Loading 4 Real Numbers from Arbitrary Memory Locations**



7 instructions, this is the right way (before SSE4)

## **Code For Previous Slide**

# Example: Loading 4 Real Numbers from Arbitrary Memory Locations (cont'd)

- Whenever possible avoid the previous situation
- Restructure algorithm and use the aligned \_mm\_load\_ps()
- Other possibility (but also yields 7 instructions and less safe)

```
__m128 vf;
vf = _mm_set_ps(*p3, *p2, *p1, *p0);
```

- SSE4: \_mm\_insert\_epi32 together with \_mm\_castsi128\_ps
  - Not clear whether better

# Example: Loading 4 Real Numbers from Arbitrary Memory Locations (cont'd)

Do not do this (why?):

```
__declspec(align(16)) float g[4];
__m128 vf;

g[0] = *p0;
g[1] = *p1;
g[2] = *p2;
g[3] = *p3;
vf = __mm_load_ps(g);
```

# **Example: Storing 4 Real Numbers to Arbitrary Memory Locations**



7 instructions, shorter critical path (before SSE4)

```
__m128i _mm_alignr_epi8(__m128i a, __m128i b, const int n)
```

Concatenate a and b and extract byte-aligned result shifted to the right by n bytes

Example: View \_\_m128i as 4 32-bit ints; n = 12



How to use this with floating point vectors?

```
Use with _mm_castsi128_ps !
```

## **Example**

```
void shift(float *x, float *y, int n) {
  for (int i = 0; i < n-1; i++)
    y[i] = x[i+1];
  y[n-1] = 0;
}</pre>
```

**No: bandwidth bound** 

```
#include <ia32intrin.h>
// n a multiple of 4, x, y are 16-byte aligned
void shift vec(float *x, float *y, int n) {
 m128 f;
 __m128i i1, i2, i3;
 i1 = mm castps si128(mm load ps(x));
                                           // load first 4 floats and cast to int
for (int i = 0; i < n-8; i = i + 4) {
   i2 = _mm_castps_si128(_mm_load_ps(x+4+i)); // load next 4 floats and cast to int
   f = _mm_castsi128_ps(_mm_alignr_epi8(i2,i1,4)); // shift and extract and cast back
   mm store ps(y+i,f);
                                           // store it
   i1 = i2;
                                            // make 2nd element 1st
 // we are at the last 4
 _mm_store_ps(y+n-4,f);
                                            // store it
```

# **Vectorization**



Picture: www.druckundbestell.de

Result is filled in each position by any element of a or with 0, as specified by mask

Example: View \_\_m128i as 4 32-bit ints



Use with \_mm\_castsi128\_ps to do the same for floating point

**(SSE4)** Result is filled in each position by an element of a or b in the same position as specified by mask

**Example:** mask = 2 = 0010



\_MM\_TRANSPOSE4\_PS(row0, row1, row2, row3)

**Macro for 4 x 4 matrix transposition:** The arguments row0,..., row3 are \_\_m128 values each containing a row of a 4 x 4 matrix. After execution, row0, .., row 3 contain the columns of that matrix.



In SSE: 8 shuffles (4 mm unpacklo ps, 4 mm unpackhi ps)

# **Vectorization With Intrinsics: Key Points**

- Use aligned loads and stores
- Minimize overhead (shuffle instructions)
  - = maximize vectorization efficiency

Definition: Vectorization efficiency

Op count of scalar (unvectorized) code

Op count of vectorized code

includes shuffles
does not include loads/stores

- *Ideally:* Efficiency = v for v-way vector instructions
  - assumes no vector instruction does more than v scalar ops
  - assumes every vector instruction has the same cost

# Vectorization Efficiency: Example I

```
// n is even
void lp(float *x, float *y, int n) {
  for (int i = 0; i < n/2; i++)
    y[i] = (x[2*i] + x[2*i+1])/2;
}</pre>
```

# **Vectorization Efficiency: Example 2**

- 4 x 4 matrix-vector multiplication
- Blackboard



## **SIMD Extensions and SSE**

- Overview: SSE family, floating point, and x87
- SSE intrinsics
- Compiler vectorization

#### **References:**

Intel icc manual (currently 12.0)  $\rightarrow$  Creating parallel applications

→ Automatic vectorization

http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/lin/index.htm

# **Compiler Vectorization**

- Compiler flags
- Aliasing
- Proper code style
- Alignment

# Compiler Flags (icc 12.0)

| Linux* OS and Mac OS* X | Windows* OS       | Description                                                                                                                                                                                                                                           |
|-------------------------|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| -vec<br>-no-vec         | /Qvec<br>/Qvec-   | Enables or disables vectorization and transformations enabled for vectorization. Vectorization is enabled by default. To disable, use -no-vec (Linux* and MacOS* X) or /Qvec- (Windows*) option. Supported on IA-32 and Intel® 64 architectures only. |
| -vec-report             | /Qvec-report      | Controls the diagnostic messages from the vectorizer. See <u>Vectorization Report</u> .                                                                                                                                                               |
| -simd<br>-no-simd       | /Qsimd<br>/Qsimd- | Controls user-mandated (SIMD) vectorization. User-mandated (SIMD) vectorization is enabled by default. Use the -no-simd (Linux* or MacOS* X) or /Qsimd-(Windows*) option to disable SIMD transformations for vectorization.                           |

#### **Architecture flags:**

**Linux:** -xHost ⊃ -mHost

Windows: /QxHost ⊃ /Qarch:Host

Host in {SSE2, SSE3, SSSE3, SSE4.1, SSE4.2}

Default: -mSSE2, /Qarch:SSE2

# **How Do I Know the Compiler Vectorized?**

- vec-report (previous slide)
- Look at assembly: mulps, addps, xxxps
- Generate assembly with source code annotation:
  - Visual Studio + icc: /Fas
  - icc on Linux/Mac: -S

# **Example**

```
void myadd(float *a, float *b, const int n) {
  for (int i = 0; i< n; i++)
    a[i] = a[i] + b[i];
}</pre>
```

#### unvectorized: /Qvec-

#### vectorized:

```
<more>
       a[i] = a[i] + b[i];
      xmm0, DWORD PTR [rcx+r11*4]
movss
      xmm0, DWORD PTR [rdx+r11*4]
addss
                                                          why this?
          DWORD PTR [rcx+r11*4], xmm0
movss
          xmm0, XMMWORD PTR [rdx+r10*4]
movups
          xmm1, XMMWORD PTR [16+rdx+r10*4]
movups
                                                         why everything twice?
addps
          xmm0, XMMWORD PTR [rcx+r10*4]
                                                         why movups and movaps?
          xmm1, XMMWORD PTR [16+rcx+r10*4]
addps
          XMMWORD PTR [rcx+r10*4], xmm0
movaps
          XMMWORD PTR [16+rcx+r10*4], xmm1
movaps
                                                            unaligned
                                                                          aligned
<more>
```

# **Aliasing**

```
for (i = 0; i < n; i++)
a[i] = a[i] + b[i];</pre>
```

Cannot be vectorized in a straightforward way due to potential aliasing.

However, in this case compiler can insert runtime check:

```
if (a + n < b || b + n < a)
    /* vectorized loop */
    ...
else
    /* serial loop */
    ...</pre>
```

# **Removing Aliasing**

- Globally with compiler flag:
  - -fno-alias, /Oa
  - -fargument-noalias, /Qalias-args- (function arguments only)
- For one loop: pragma

```
void add(float *a, float *b, int n) {
    #pragma ivdep
    for (i = 0; i < n; i++)
        a[i] = a[i] + b[i];
}</pre>
```

■ For specific arrays: restrict (needs compiler flag -restrict, /Qrestrict)

```
void add(float *restrict a, float *restrict b, int n) {
for (i = 0; i < n; i++)
    a[i] = a[i] + b[i];
}</pre>
```

# **Proper Code Style**

- Use countable loops = number of iterations known at runtime
  - Number of iterations is a: constant loop invariant term linear function of outermost loop indices

#### Countable or not?

```
for (i = 0; i < n; i++)
a[i] = a[i] + b[i];</pre>
```

```
void vsum(float *a, float *b, float *c) {
  int i = 0;

while (a[i] > 0.0) {
    a[i] = b[i] * c[i];
    i++;
  }
}
```

# **Proper Code Style**

- Use arrays, structs of arrays, not arrays of structs
- Ideally: unit stride access in innermost loop

```
void mmm1(float *a, float *b, float *c) {
  int N = 100;
  int i, j, k;

for (i = 0; i < N; i++)
  for (j = 0; j < N; j++)
    for (k = 0; k < N; k++)
        c[i][j] = c[i][j] + a[i][k] * b[k][j];
}</pre>
```

```
void mmm2(float *a, float *b, float *c) {
  int N = 100;
  int i, j, k;

for (i = 0; i < N; i++)
    for (k = 0; k < N; k++)
    for (j = 0; j < N; j++)
        c[i][j] = c[i][j] + a[i][k] * b[k][j];
}</pre>
```

# Alignment

```
float x[1024];
int i;

for (i = 0; i < 1024; i++)
   x[i] = 1;</pre>
```

Cannot be vectorized in a straightforward way since x may not be aligned

However, the compiler can peel the loop to extract aligned part:

```
float x[1024];
int i;

peel = x & 0x0f; /* x mod 16 */
if (peel != 0) {
  peel = 16 - peel;
  /* initial segment */
  for (i = 0; i < peel; i++)
    x[i] = 1;
}
/* 16-byte aligned access */
for (i = peel; i < 1024; i++)
  x[i] = 1;</pre>
```

# **Ensuring Alignment**

- Align arrays to 16-byte boundaries (see earlier discussion)
- If compiler cannot analyze:
  - Use pragma for loops

```
float x[1024];
int i;

#pragma vector aligned
for (i = 0; i < 1024; i++)
    x[i] = 1;</pre>
```

For specific arrays:
 \_\_assume\_aligned(a, 16);

```
void myadd(float *a, float *b, const int n) {
  for (int i = 0; i< n; i++)
    a[i] = a[i] + b[i];
}</pre>
```

#### Assume:

- No aliasing information
- No alignment information

Can compiler vectorize?

#### Yes: Through versioning



# **Compiler Vectorization**

Read manual